Àá½Ã¸¸ ±â´Ù·Á ÁÖ¼¼¿ä. ·ÎµùÁßÀÔ´Ï´Ù.
KMID : 1137820220430020109
ÀÇ°øÇÐȸÁö
2022 Volume.43 No. 2 p.109 ~ p.115
Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification
Yoo Sung-Lim

Abstract
Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic med- ical records classification.

Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization tech- niques.

Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vec- torization techniques.

Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.
KEYWORD
Natural language processing, Medical records classification, Vectorization techniques, Machine learning, Latent semantic analysis
FullTexts / Linksout information
Listed journal information
ÇмúÁøÈïÀç´Ü(KCI) ´ëÇÑÀÇÇÐȸ ȸ¿ø